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Abstract. The term "nexting" has been used by psychologists to refer 
to the propensity of people and many other animals to continually pic- 
dict what will happen next in an immediate, local, and personal sense. 
The ability to "next" constitutes a basic kind of awareness and knowl- 
edge of one's environment. In this paper wc present results with a robot 
that learns to next in real time, predicting thousands of features of the 
world's state, including all sensory inputs, at timcscalcs from 0.1 to 8 sec- 
onds. This was achieved by treating each state feature as a reward-like 
target and applying temporal-difference methods to learn a correspond- 
ing value function with a discount rate corresponding to the timescalc. 
We show that two thousand predictions, each dependent on six thousand 
state features, can be learned and updated online at better than lOHz 
on a laptop computer, using the standard TD(A) algorithm with linear 
function approximation. We show that this approach is efficient enough 
to be practical, with most of the learning complete within 30 minutes. We 
also show that a single tile-coded feature representation suffices to accu- 
rately predict many different signals at a significant range of timescales. 
Finally, we show that the accuracy of our learned predictions compares 
favorably with the optimal off-line solution. 

1 Multi-timescale Nexting 

Psychologists have noted that people and other animals seem to continually 
make large numbers of short-term predictions about their sensory input (e.g., 
see Gilbert 2006, Brogdcn 1939, Pczzulo 2008, Carlsson ct al. 2000). When we 
hear a melody we predict what the next note will be or when the next downbeat 
will occur, and are surprised and interested (or annoyed) when our predictions 
arc disconfirmcd (Huron 2006, Lcvitin 2006). When wc sec a bird in flight, hear 
our own footsteps, or handle an object, we continually make and confirm multiple 
predictions about our sensory input. When we ride a bike, ski, or rollerblade, we 
have finely tuned momcnt-by-momcnt predictions of whether we will fall, and of 
how our trajectory will change in a turn. In all these examples, we continually 
predict what will happen to us next. Making predictions of this simple, personal, 
short-term kind has been called nexting (Gilbert, 2006). 

Nexting predictions are specific to one individual and to their personal, im- 
mediate sensory signals or state variables. A special name for these predictions 



seems appropriate because they are unlike predictions of the stock market, of 
political events, or of fashion trends. Predictions of such public events seem to 
involve more cognition and deliberation, and are fewer in number. In nexting 
we envision that one individual may be continually making massive numbers of 
small predic;tions in parallel. Moreover, nexting predictions seem to be made si- 
multaneously at multiple time scales. When we read, for example, it seems likely 
that we next at the letter, word, and sentence levels, each involving substantially 
different time scales. 

The ability to predict and anticipate has often been proposed as a key part of 
intelligence (e.g., see Tolman 1951, Hawkins & Blakeslee 2004, Butz et al. 2003, 
Wolpert et al. 1995, Clark in press). Nexting can be seen as the most basic kind 
of prediction, preceding and possibly underlying all the others. That people and 
a wide variety of animals learn and make simple predictions at a range of short 
time scales in conditioning experiments was established so long ago that it is 
known as classical conditioning (Pavlov 1927). Predictions of upcoming shock 
to a paw may reveal themselves in limb-retraction attempts a fraction of a second 
before the shock, and as increases in heart rate 30 seconds prior. In other ex- 
periments, for example those known as sensory preconditioning (Brogdcn 1939, 
Rescorla 1980), it has been clearly shown that animals learn predictive relation- 
ships between stimuli even when none of them are inherently good or bad (like 
food and shock) or connected to an innate response. In this case the predictions 
are made, but not expressed in behaviour until some later experimental manip- 
ulation connects them to a response. Animals seem to just be wired to learn the 
many predictive relationships in their world. 

To be able to next is to have a basic kind of knowledge about how the world 
works in interaction with one's body. It is to have a limited form of forward model 
of the world's dynamics. To bo able to learn to next to notice any disconfirmed 
predictions and continually adjust your nexting — is to be aware of one's world 
in a significant way. Thus, to build a robot that can do both of these things is a 
natural goal for artificial intelligence. Prior attempts to achieve artificial nexting 
can be grouped in two approaches. 

The first approach is to build a myopic forward model of the world's dy- 
namics, either in terms of differential equations or state-transition probabilities 
(e.g., Wolpert et al. 1995, Crush 2004, Sutton 1990). In this approach a small 
number of carefully chosen predictions are made of selected state variables with 
a public meaning. The model is myopic in that the predictions are only short 
term, cither infinitcsimally short in the case of differential equations, or maxi- 
mally short in the case of the one-step predictions of Markov models. In these 
ways, this approach has ended up in practice being very different from nexting. 

The second approach, which we follow here, is to use temporal-difference 
(TD) methods to learn long-term predictions directly. The prior work pursuing 
this approach has almost all been in simulation, and has used table-lookup repre- 
sentations and a small number of predictions (e.g., Sutton 1995, Kaelbling 1993, 
Singh 1992, Sutton, Precup & Singh 1999, Dayan and Hinton 1993). Sutton et 
al. (2011) showed real-time learning of TD predictions on a robot, but did not 



demonstrate the ability to learn many predictions in real time or with a single 
feature representation. 

2 Nexting as Multiple Value Functions 

Wc take a rcinforccmcnt-lcarning approach to achieving nexting. In reinforce- 
ment learning it is commonplace to learn long-term predictions of reward, called 
value functions, and to learn these using temporal-difference (TD) methods such 
as TD(A) (Sutton 1988). However, TD(A) has also been used as a model of clas- 
sical conditioning, where the predictions are shorter term and where more than 
one signal might be viewed as a reward (Sutton & Barto, 1990). Our approach to 
nexting can be seen as taking this latter approach to the extreme of predicting 
massive numbers of target signals of all kinds at multiple time scales. 

We use a notation for our multiple predictions that mirrors — or rather 
multiplies that used for conventional value functions. Time is taken to be dis- 
crete, t = 1,2,3,..., with each time step corresponding to approximately 0.1 
seconds of real time. Our ith prediction at time t, denoted vl, is meant to antic- 
ipate the future values of the ith prediction's target signal, rl, over a designated 
time scale given by the discount-rate parameter 7*. In our experiments, the 
target signal rj was either a raw sensory signal or else a component of a state- 
feature vector (that we will introduce shortly), and the discount-rate parameter 
was one of four fixed values. The goal of learning is for each prediction to ap- 
proximately equal the correspondingly discounted sum of the future values of 
the corresponding target signal: 
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The random quantity is known as the return. 

We use linear function approximation to form each prediction. That is, we 
assume that the state of the world at time t is characterized by the feature vector 
(j)t G M" , and that all the predictions w| are formed as inner products of (j)t with 
the corresponding weight vectors 01: 

vi = <flel Y.M)ei{j), (2) 
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where (pj denotes the transpose of (pt (all vectors are column vectors unless 
transposed) and <pt{j) denotes its jth component. The predictions at each time 
are thus determined by the weight vectors 91- One natural algorithm for learning 
the weight vectors is linear TD(A): 

=91 + a (rj+i + 7>^i0? - ^1 61) (3) 

where a > is a step-size parameter and e\ S K" is an eligibility trace vector, 
initially set to zero and then updated on each step by 
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where A G [0, 1] is a trace-decay parameter. 

Under common assumptions and a decreasing step-size parameter, TD(A) 
with A = 1 converges asymptotically to the weight vector that minimizes the 
mean squared error between the prediction and its return. In practice, smaller 
values of A € [0, 1) are almost always used because they can result in significantly 
faster learning (e.g., see Sutton & Barto 1998), but the A = I case still provides 
an important theoretical touchstone. In this case we can define an optimal weight 
value 6\ that minimizes the squared error from the return over the first N 
predictions: 

N 

0:=argmm^(070-GJ)'. (5) 
t=i 

This value can be computed offline by standard algorithms for solving large least- 
squares regression problems, and the performance of this offiine-optimal value 
can be compared with that of the weight vectors found online by TD(A). The 
offline algorithm is O(n^) in computation and 0{n^) in memory, and thus is just 
barely tractable for the cases we consider here, in which n — 6065. Nevertheless, 
91 provides an important performance standard in that it provides an upper 
limit on one measure of the quality of the predictions found by learning. This 
upper limit is determined not by any learning algorithm, but by the feature 
representation. As we will see, even the predictions due to 61 will have residual 
error. Thus, this analysis provides a method for determining when performance 
can be improved with more experience and when performance improvements 
require a better representation. Note that this technique is applicable even when 
experience is gathered from the physical world, where no formal notion of state 
is available. 

3 Experimental Setup 

We investigated the practicality of nexting on the Critterbot, a custom-designed 
robust and sensor-rich mobile robot platform (Figure [l] left). The robot has a 
diverse set of sensors and has holonomic motion provided by three omni-wheels. 
Sensors attached to the motors report the electrical current, the input motor 
voltage, motor temperature, wheel rotational velocities, and an overheating flag, 
providing substantial observability of the internal physical state of the robot. 
Other sensors collect information from the external environment. Passive sen- 
sors detect ambient light in several directions from the top of the robot in the 
visible and infrared spectrum. Active sensors emit infrared light and measure 
the reflectance, providing information about the distance to nearby obstacles. 
Other sensors report acceleration, rotation, and the magnetic field. In total, we 
consider 53 different sensor readings, all normalized to values between and 1 
based on sensor limits. 

For our experiments, the agent's state representation was a binary vector, 
(j)t € {0, 1}", with a constant number of 1 features, constructed by tile coding 
(see Sutton & Barto 1998). The features provided no history and performed no 




Fig. 1. Left: The Critterbot, a custom mobile robot witli multiple sensors. Right: 
The Critterbot gathering experience while wall- following in its pen. This experience 
contains observations of both stochastic events (such as ambient light variations from 
the sun) and regular events (such as passing a lamp on the lower- left side of the pen). 



averaging of sensor values. The sensory signals were partitioned based on sensor 
modalities. Within each sensor modality, each individual sensor (e.g., LightO) has 
multiple overlapping tilings at random offsets (up to 8 tilings), where each tiling 
splits the sensor range into disjoint intervals of fixed width (up to 8 intervals). 
Additionally, pairs of sensors within a sensor modality were tiled together using 
multiple two-dimensional overlapping grids. Pairs of sensors were jointly tiled if 
they were spatially adjacent on the robot (e.g., IRLightO with IRLightl) or if 
there was a single sensor in between them (e.g., IRDistancef with IRDistanceS, 
IRDistance2 with IRDistance4, etc.). All in all, this tiling scheme produced a 
feature vector with n — 6065 components, most of which were Os, but exactly 
457 of which were Is, including one bias feature that was always 1. 

The robot experiment was conducted in a square wooden pen, approximately 
two meters on a side, with a lamp on one edge (see Figure[T]). The robot's actions 
were selected according to a fixed stochastic wall-following policy. This policy 
moved forward by default, slid left or right to keep a side IRDistance sensor 
within a bounded range (50-200), and drove backward while turning when the 
front IRDistance sensor reported a nearby obstacle. The robot completed a loop 
of the pen approximately once every 40 seconds. Due to overheating protection, 
the motors would stop to cool down at approximately 14 minute intervals. To 
increase the diversity of the data, the policy selected an action at random with 
a probability p = 0.05. At every time step (approximately 100ms), sensory data 
was gathered and an action performed. This simple policy was sufficient for the 
robot to reliably follow the wall for hours, even with overheating interruptions. 

The wall-following policy, tile-coding, and the TD(A) learning algorithm were 
all implemented in Java and run on a laptop connected to the robot by a dedi- 
cated wireless link. The laptop used an Intel Core 2 Duo processor with a 2.4GHz 
clock cycle, 3MB of shared L3 cache, and 4GB DDR3 RAM. The system garbage 
collector was called on every time step to reduce variability. Four threads were 



used for the learning code. For offline analysis, data was also logged to disk for 
120000 time steps (3 hours and 20 minutes). 



4 Results 

We applied TD(A) to learn 2160 predictions in parallel. The first 212 predictions 
had the target signal, , set to the sensor reading of one of the 53 sensors and the 
discount rate, 7*, set to one of four timescales; the remaining 1948 predictions 
had the target signal set to one of 487 randomly selected components of the 
feature vector and the discount rate set to one of four timescales. The discount 
rates were one of the four values in {0,0.8,0.95,0.9875}, corresponding to time 
scales of approximately 0.1, 0.5, 2, and 8 seconds respectively. The learning 
parameters were A = 0.9 and a = 0.1/457(= # of active features). The initial 
weight vector was set to zero. 

Our initial performance question was scalability, in particular, whether so 
many predictions could be made and learned in real time. We found that the 
total computation time for a cycle under our conditions was 55ms, well within the 
100ms duty cycle of the robot. The total memory consumption was 400MB. Note 
that with faster computers the number of predictions or the size of the weight 
and feature vectors could be increased at least proportionally. This strategy 
for nexting should be easily scalable to millions of predictions with foreseeable 
increases in parallel computing power over the next decade. 
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Fig. 2. Nexting is demonstrated in these graphs with predictions that rise and fall 
prior to the increase and decrease of a sensory signal. Comparison of ideal (left) and 
learned (right) predictions of one of the light sensors for three trips around the pen after 
2.5 hours of experience. On each trip, the sensor value saturates at 1.0. The returns 
for the 2 and 8-second predictions, shown on the left, rise in anticipation of the high 
value, and then fall in anticipation of the low value. The 8-second predictions in the 
second panel of the offline-optimal weights (dotted blue line) and the TD(A)-learned 
weights (solid red line) behave similarly both to each other and to the returns (albeit 
with more noise). 



For an initial assesment of accuracy, let us take a close look at one of the 
predictions, in particular, at the prediction for one of the light sensors. Notice 
that there is a bright lamp in the lower left corner of the pen in Figure 1 (right). 
On each trip around the pen, the light sensor increases to its maximal level and 
then falls back to a low level, as shown by the black line in Figure [2j If the state 
features are sufficiently informative, then the robot may be able to anticipate 
the rising and falling of this sensor value. The ideal prediction is the return Gl, 
shown on the left in the colored lines in Figure 2 for two time scales (two seconds 
and eight seconds). Of course, to determine these lines, we had to use the future 
values of the light sensor; the idea here is to approximate these ideal predictions 
(as in Equation[5]) using only the sensory information available to the robot in its 
feature vector. The second panel of the figure shows the predictions due to the 
weight vector adapted online by TD(A) and due to the optimal weight vector, 
91, computed offline (both for the 8-second time scale). The key result is that 
the robot has learned to anticipate both the rise and fall of the light. Both the 
learned prediction and the optimal offline prediction match the return closely, 
though with substantial noisy perturbations. 

Figure [3] is a still closer look at 
this same prediction, obtained by av- 
eraging over 100 circuits around the 
pen, aligning each circuit's data so 
that the time of initial saturation of 
the light sensor is the same. We can 
now see very clearly how the pre- 
dictions and returns anticipate both 
the rise and fall of the sensor value, 
and that both the TD(A) prediction 
and the optimal prediction, when av- 
eraged, closely match the return. 

Having demonstrated that accu- 
rate prediction is possible, we now 
consider the rate of learning in Fig- 
ure [4j The graphs shows that learning 
is fast in terms of data (despite the 
large number of features), converging 
to solutions with low error in the familiar exponential way. This result is impor- 
tant as is demonstrates that learning online in real time is possible on robots 
with a few hours of experience, even with a large distributed representation. For 
contrast, we also show the learning curve for a trivial representation consisting 
only of a bias unit (the single feature that is always 1). The comparison serves 
to highlight that large informative feature sets are beneficial. The comparison 
to the predictive performance of the offline-optimal solution shows a vanishing 
performance gap by the end of the experiment. The second panel of the figure 
shows a similar pattern of decreasing errors for a sample of the 2160 TD(A) 
predictions, showing that learning many predictions in parallel yields similar 
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Fig. 3. An average of 100 cycles like 
the three shown in Figure [2] (right panel), 
aligned on the onset of sensor saturation. 
Error bars are slightly wider than the lines 
themselves and overlap substantially, so are 
omitted for clarity 
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Fig. 4. Nexting learning curves for the 8-second light sensor predictions (left) and for 
a representative sample of the TD(A) predictions (right). Predictions at different time 
scales have had their root mean squared error (RMSE) normalized by • The graph 
on the left is a comparison of different learning algorithms. The jog in the middle of 
the first graph occurs when the robot stops by the light to cool off its motors, causing 
the online learners to start making poor predictions. In spite of the unusual event, the 
TD(A) solution still approaches the offline-optimal solution. TD(A) performs similarly 
to a supervised learner TD(1), and they both slightly outperform TD(0). The curve 
for the bias unit shows the poor performance of a learner with a trivial representation. 
The graph on the right shows that seemingly all the TD(A) predictions are learning 
well with a single feature representation and a single set of learning parameters. 

results. A noteworthy result is that the same learning parameters and repre- 
sentation suffice for learning answers to a wide variety of nexting predictions 
without any convergence problems. Although the answers continue to improve 
over time, the most dramatic gains were achieved after 30 minutes of real time. 

5 Discussion 

These results provide evidence that online learning of thousands of nexting pre- 
dictions on a robot in parallel is possible, practical, and accurate. Moreover, the 
predictive accuracy is reasonable with just a few hours of robot experience, no 
tuning of algorithm parameters, and using a single feature representation for all 
predictions. The parallel scalability of knowledge-acquisition in this approach 
is substantially novel when compared with the predominately sequential exist- 
ing approaches common for robot learning. These results also show that online 
methods can be competitive in accuracy with an offline optimization of mean 
squared error. 

The ease with which a simple reinforcement learning algorithm enables nex- 
ting on a robot is somewhat surprising. Although the formal theories of re- 
inforcement learning sometimes give mathematical guarantees of convergence, 
there is little guidance for the choice of features for a task, for selecting learn- 
ing parameters across a range of tasks, or for how much experience is required 
before a reinforcement learning system will approach convergence. The experi- 



merits show that we can use the same features across a range of tasks, anticipate 
events before they occur, and achieve predictive accuracy approaching that of 
an offline-optimal solution with a limited amount of robot experience. 

6 More General Nexting 

The exponentially discounted predictions that we have focused on in this paper 
constitute the simplest kind of nexting. They are a natural first kind of predictive 
knowledge to be learned. Online TD-style algorithms can be extended to handle 
a much broader set of predictions, including time-varying choices of 7, time- 
varying A, and even off-policy prediction (Maei & Sutton 2010). It has even 
been proposed that all world knowledge can be represented by a sufficiently 
large and diverse set of predictions (Sutton 2009). 

As one example of such an exten- 
sion, consider allowing the discount 
rate 7' to vary as a function of the 
agent's state. The algorithmic modifi- 
cations required are straightforward. 
In the definition of the return in 
Equation 1, (7*)'' is replaced with 
-'^jLoTt-t-j ■ Equation 3, 7* is re- 
placed with "fl^i and finally, in Equa- 
tion 4, 7* is replaced with 7^. Using 
the modified definitions, the robot can 
predict how much motor power it will 
consume until either the light sensor 
is saturated or approximately two sec- 
onds elapse. This prediction can be 
formalized by setting the prediction's 
target signal to be the sum of instan- 
taneous power consumption of each 
wheel, (r = MotorVoltagei x 

MotorCurrentz) and throttling gamma when the light sensor is saturated (7^ = 
0.1 when the light sensor is saturated and 0.95 otherwise). The plots in Figure [5] 
shows that the robot has learned to anticipate how much power will be expended 
prior to reach the light or spontaneously terminating. 

7 Conclusions 

We have demonstrated multi-timescale nexting on a physical robot; thousands 
of anticipatory predictions at various time-scales can be learned in parallel on 
a physical robot in real-time using a reinforcement learning methodology. This 
approach uses a large feature representation with an online learning algorithm 
to provide an efficient means for making parallel predictions. The algorithms are 



16 




Seconds 



Fig. 5. Nexting can be extended, for exam- 
ple to consider time-varying gamma to pre- 
dict of the amount of power that the robot 
will expend before a probabilistic pseudo- 
termination with a 2-second time horizon 
or a saturation event on the light sensor. 



capable of making real-time predictions about the future of the robot's sensors 
at multiple time-scales using the computational horsepower of a laptop. Finally, 
and key to the practical application of our approach, we have shown that a single 
feature representation and a single set of learning parameters are sufficient for 
learning many diverse predictions. A natural direction for future work would be 
to extend these results to more general predictions and to control. 

Acknowledgement s 

The authors thank Mike Sokolski for creating the Critterbot and Patrick Pi- 

larski and Thomas Dcgris for preparation of Figure 1 and for essential assistance 
with the experiment briefly reported in Section 6. This work was supported by 
grants from Alberta Innovates - Technology Futures, the National Science and 
Engineering Reseach Council of Canada, and the Alberta Innovates Centre for 
Machine Learning. 

References 

Brogden, W. (1939). Sensory pre-conditioning. Journal of Experimental Psy- 
chology ^5(4):323-332. 

Butz, M., Sigaiid, O., Gerard, P., Eds. (2003). Anticipatory Behaviour in Adap- 
tive Learning Systems: Foundations, Theories, and Systems, LNAI 2684, Springer. 

Carlsson, K., Petrovic, P., Skare, S., Petersson, K., Ingvar, M. (2000). Tickling 
expectations: neural processing in anticipation of a sensory stimulus. Journal 
of Cognitive Neuroscience i^(4):691-703. 

Clark, A. (in press). Whatever Next? Predictive Brains, Situated Agents, and 
the Future of Cognitive Science. Behavioral and Brain Sciences. 

Dayan, P., Hinton, G. (1993). Feudal reinforcement learning. Advances in Neural 
Information Processing Systems 5, pp. 271-278. 

Gilbert, D. (2006). Stumbling on Happiness. Knopf Press. 

Crush, R. (2004). The emulation theory of representation: motor control, im- 
agery, and perception. Behavioural and Brain Sciences ^27:377-442. 

Hawkins, J., Blakeslee, S. (2004). On Intelligence. Times Books. 

Huron, D. (2006). Sweet anticipation: Music and the Psychology of Expectation. 
MIT Press. 

Kaelbling, L. (1993). Learning to achieve goals. In Proceedings of International 
Joint Conference on Artificial Intelligence. 

Levitin, D. (2006). This is Your Brain on Music. Button Books. 



Pavlov, I. (1927). Conditioned Reflexes: An Investigations of the Physiological 
Activity of the Cerebral Cortex, translated and edited by G. V. Anrep. Oxford 
University Press. 

Pezzulo, G. (2008). Coordinating with the future: The anticipatory nature of 
representation. Minds and Machines J(S(2):179-225. 

Rescorla, R. (1980). Simultaneous and successive associations in sensory precon- 
ditioning. Journal of Experimental Psychology: Animal Behavior Processes 
6(3):207-216. 

Singh, S. (1992). Reinforcement learning with a hierarchy of abstract models. 
Proceedings of the Conference of the Association for the Advancement of Ar- 
tificial Intelligence (AAAI-92), pp. 202-207. 

Sutton, R. S. (1988). Learning to predict by the method of temporal differences. 

Machine Learning 5:9-44. 

Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting 
based on approximating dynamic programming. Proceedings of the Seventh 
International Conference on Machine Learning, pp. 216-224. 

Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. 
Proceedings of the International Conference on Machine Learning, pp. 531- 
539. 

Sutton, R. S. (2009). The grand challenge of predictive empirical abstract knowl- 
edge. In: Working Notes of the IJCAI-09 Workshop on Grand Challenges for 

Reasoning from Experiences. 

Sutton, R. S., Barto, A. G. (1990). Time-derivative models of Pavlovian re- 
inforcement. In Learning and Computational Neuroscience: Foundations of 
Adaptive Networks, pp. 497-537. MIT Press. 

Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. 
MIT Press. 

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., Pre- 
cup, D. (2011). Horde: A scalable real-time architecture for learning knowledge 
from unsupervised sensorimotor interaction. Proceedings of the 10th Interna- 
tional Conference on Autonomous Agents and Multiagent Systems, pp. 761- 
768. 

Sutton, R. S., Prccup, D., Singh, S. (1999). Between MDPs and semi-MDPs: 
A framework for temporal abstraction in reinforcement learning. Artificial 

Intelligence 112:181-211. 

Tolman, E. C. (1951). Purposive Behavior in Animals and Men. University of 
California Press. 

Wolpert, D., Ghahramani, Z., Jordan, M. (1995). An internal model for sensori- 
motor integration. Science .25f(5232):1880-1882. 



